110 ◾ Bioinformatics
in this case, can either lead to no change in the structure and function of the protein
(conservative mutation) or it may lead to a deleterious consequence if the change alters
the protein structure and function. The base substitution may also have a nonsense con-
sequence if it results in a stop codon that truncates the translated protein leading to an
incomplete and nonfunctional protein.
A deletion mutation is the removal of a single pair of nucleotides or more from a gene
that may result in a frameshift and a garbled message and nonfunctional product. Deletion
may have deleterious consequence or not depending on the part it alters and its impact
on the protein sequence. The insertion mutation is the insertion of additional base pairs
and it may lead to frameshifts depending on whether or not multiples of three base pairs
are inserted. Mutations may include combinations of insertions and deletions leading to a
variety of outcomes.
In general, a gene variant is a permanent change in the nucleotide sequence of a gene
that can be either germline variants, which occur in eggs and sperms of parents and pass
to offspring, or somatic variants, which are present only in specific cells and are generally
not hereditary.
In terms of sequence change, variants can be classified into single-nucleotide variant
(SNV), insertion–deletion (InDel), or structural variation (SV). The SNV is a base substi-
tution of a single nucleotide for another. It is known as single-nucleotide polymorphism
(SNP) if its allelic frequency in a population is more than 1%. InDel refers to insertion and/
or deletion of nucleotides into genomic DNA and it includes events less than 1000 nucleo-
tides in length. InDels are implicated as the driving mechanism underlying many diseases.
The SV involves change in more than 50 base pairs in a sequence of a gene; the change may
include rearrangement of part of the genome, a deletion, duplication, insertion, inversion,
translocation, or a combination of these. A CNV is a duplication or deletion that changes
the number of copies of a particular DNA segment within the genome. SVs have been
implicated in a number of health conditions.
In this chapter, we will learn about the major steps in the process of variant identifica-
tion and analysis, including variant representation, variant calling workflow, and variant
annotation. The process by which we identify variants from sequence data (reads) is called
variant calling, which is the central topic of this chapter.
4.1.1 VCF File Format
Since a variant is a change in a specific location in a genome, in bioinformatics, this
requires a format that can describe the type of a mutation and its position relative to the
genome coordinates. Thus, the variant call format (VCF) file [2] was developed to hold
the information of a large number of variants and also to hold genotype information of
multiple samples in the same position. The VCF file, as shown in Figure 4.1, consists of (i)
a metadata section for the meta-information and (ii) a data section for variant data. The
VCF file has become the standard file for storing variant information for almost all variant
calling programs.
Each line in the metadata section of a VCF file begins with “##”. The metadata lines
describe the format and content of a VCF file. This can include information about the